Homework 7

Author: Puri Rudick

Cluster the reviews that you collected in homework 5, by doing the following:

1) In Python, select any one of the clustering methods covered in this course. Run it over the collection of reviews, and show at least two different ways of clustering the reviews, e.g., changing k in k-Means clustering or changing where you “cut” in Agnes or Diana.

In Homework 5, I picked 5 movies in the adventure, fantacy, and superhero genre. All movies are from Marvel Studio. As you can see in the code block below with # to comment them out.
However, with the movies being too close in concepts to each other, they do not provide a good example for this Homework 7.

I decided to pick new 5 movies in the action genre. With different actors and movie studioes. 
I obtain the movie title_id from IMDB website and put them into a dictionary above.
I obtained first 250 user reviews for each movie then combined all of them into a dataframe.
The clustering technique I chose is k-Mean clustering.

To decide the number of k, I implored the elbow method which consists of pllotting different distortions as a function of a certain number of cluster.

Vectorize Reviews using TF-IDF

The elbow method plot above shows distortions that is slightly straight line but shows a bit of convergeance around 6, 9, and 16 so these will be the values of k that we will be using in this Homework.

Create a function to fit k-Mean model, to print top 10 common words from each cluster, and to plot wordcloud for each cluster.

k = 6

k = 9

k = 16


2) Try to write a short phrase to characterize (give a natural interpretation of) what each cluster is generally centered on semantically. Is this hard to do in some cases? If so, make note of that fact.

Observing the clusters above, we can see that:

With k=6, Cluster 0 shows all kinds of mix-reviews, while other clusters grouped around commonalities in specific movies.

With k=9, Cluster 0 and Cluster 1 still shows some kinds of mix-reviews, while Cluster 7 clustered around possitive reviews. Other clusters grouped around commonalities in specific movies.  Cluster 4 and Cluster 8 clustered around 'The Man from Toronto' movie with main actors.

With k=16, Cluster 0 and CLuster 7 clustered around possitive reviews, while Cluster 11 clustered around negative reviews.  Other clusters grouped around commonalities in specific movies. We started to see one movie clustered into multiple clusters which doesn't seem to be any real definition of what the centroid is centered on. All of the clusters have the words film and movie in them which could be center points.

3) Explain which of the two clustering results from question 1 is preferable (if one of them is), and why.

According to Question 2, my preferable cluster is k=9. This is because with the lower number of k (k=6), we do see clusters around specific movies, but they do not provide enough information.  With the higher k (k=16), we do see more information but some clusters become too scatter.  In my opinion, I would prefer k=9.  It gives 'about right' answer for this Homework.